Interactive Exploration of Multivariate Categorical Data: Exploiting Ranking Criteria to Reveal Patterns and Outliers
نویسنده
چکیده
Analyzing multivariate datasets requires users to understand distributions of single variables and at least the two-way relationships between the variables. Lower-dimension projection techniques may assist users in finding interesting combinations. To explore the 2D relationships in a systematic way, we suggest ranking such relationships according to some measure of interestingness. This approach has been valuable for continuous data; however, metrics for categorical data are a novel contribution. We propose CateRank a tool for analyzing categorical datasets which visualizes one-dimensional relationships as histograms and uses re-orderable matrix for two-dimensional relationships. CateRank implements several metrics based on the histogram and matrix properties that enable users to discover relationships between the two categorical variables. User controls support data filtering to remove extreme or uninteresting values.
منابع مشابه
Local multivariate outliers as geochemical anomaly halos indicators, a case study: Hamich area, Southern Khorasan, Iran
Anomaly recognition has always been a prominent subject in preliminary geochemical explorations. Among the regional geochemical data processing, there are a range of statistical and data mining techniques as well as different mapping methods, which serve as presentations of the outputs. The outlier’s values are of interest in the investigations where data are gathered under controlled condition...
متن کاملCycle Plot Revisited: Multivariate Outlier Detection Using a Distance-Based Abstraction
The cycle plot is an established and effective visualization technique for identifying and comprehending patterns in periodic time series, like trends and seasonal cycles. It also allows to visually identify and contextualize extreme values and outliers from a different perspective. Unfortunately, it is limited to univariate data. For multivariate time series, patterns that exist across several...
متن کاملIdentification of outliers types in multivariate time series using genetic algorithm
Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...
متن کاملExploratory Data Analysis With Categorical Variables: An Improved Rank-by-Feature Framework and a Case Study
Multidimensional datasets often include categorical information. When most dimensions have categorical information, clustering the dataset as a whole can reveal interesting patterns in the dataset. However, the categorical information is often more useful as a way to partition the dataset: gene expression data for healthy vs. diseased samples or stock performance for common, preferred, or conve...
متن کاملMaximum trimmed likelihood estimator for multivariate mixed continuous and categorical data
Abstract In this article we apply the maximum trimmed likelihood (MTL) approach (Hadi and Luceño 1997) to obtain the robust estimators of multivariate location and shape, especially for data mixed with continuous and categorical variables. The forward search algorithm (Atkinson 1994) is adapted to compute the proposed MTL estimates. A simulation study shows that the proposed estimator outperfor...
متن کامل